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Abstract 

Targeting at sparse learning, we construct Banach spaces B of functions on an input 
space X with the foUowing properties: (1) B possesses an £^ norm in the sense that B is 
isometricaUy isomorphic to the Banach space of integrable functions on X with respect to 
the counting measure; (2) point evaluations are continuous linear functionals on B and are 
representable through a bilinear form with a kernel function; and (3) regularized learning 
schemes on B satisfy the linear representer theorem. Examples of kernel functions admissible 
for the construction of such spaces are given. 

Keywords: reproducing kernel Banach spaces, sparse learning, lasso, basis pursuit, regu- 
larization, the representer theorem, the Brownian bridge kernel, the exponential kernel. 



1 Introduction 

It is now widely known that minimizing a loss function regularized by the £^ norm yields sparsity 
in the resulting minimizer. The sparsity is essential for extracting relatively low dimensional 
features from sample data that usually live in a high dimensional space. When the square loss 



function is used in regression, the method is known as the lasso in statistics [26|. Recently, the 
methodology has been applied to compressive sensing where it is referred to as basis pursuit 
[|, |. The purpose of this paper is to establish an appropriate foundation for developing £ 
regularization for machine learning with reproducing kernels. 

Past research on learning with kernels 0, |^, |22|, |2^, 24, 27] has mainly been built upon the 



theory of reproducing kernel Hilbert spaces (RKHS) Q. There are many reasons that account 
for the success from such a choice. RKHS are by definition the Hilbert space of functions 
where point evaluations are continuous linear functionals. Sample data available for learning 
are usually modeled by point evaluations of the unknown target function. Therefore, RKHS is a 
class of function spaces where sampling is stable, a desirable feature in applications. By the Riesz 
representation theorem, continuous linear functionals on a Hilbert space are representable by the 
inner product on the space. This gives rise to the representation of point evaluation functionals 
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on an RKHS by its associated reproducing kernel and leads to the celebrated representer theorem 
[p!^ ] in machine learning. This theorem states that the original minimization problem in a 
typically infinite dimensional RKHS can be converted into a problem of determining finitely 
many coefficients in a linear combination of the kernel function with one argument evaluated at 
the data sites. 

For this representer theorem, the nonzero coefficients to be found are generally as many as 
the sampling points. For the sake of economy, it is hence desirable to regularize the class of 
candidate functions by some £^ norm to force most of the coefficients to be zero. An attempt in 
this direction is the linear programming approach to coefficient based regularization for machine 
learning [ p2| . The method lacks a general mathematical foundation like the RKHS though. In 
particular, it is unknown whether the algorithm results by some representer theorem from a 
minimization on an infinite dimensional Banach space. A consequence is that the hypothesis 
error in the learning rate estimate will not go away automatically as in the RKHS case [pO| . 

We aim at combining the reproducing kernel methods and the regularization technique. 
Specifically, we desire to construct function spaces with the following properties: 

— point evaluation functionals on the space are continuous and can be represented by some 
kernel function; 

— the space possesses an £^ norm; 

— a linear representer theorem holds for regularized learning schemes on the space. 

There are three ways of representing continuous point evaluation functionals in a function space: 



by an inner product, by a semi-inner product |11, 15 1, or by a bilinear form on the tensor product 



of the space and its dual space. Since the space we constructed is expected to have an £^ norm, it 
can not have an inner product. Semi-inner products are a natural substitute for inner products in 
Banach spaces. A notion of reproducing kernel Banach spaces (RKBS) was established in 32] 



via the semi-inner product. The spaces considered there are uniformly convex and uniformly 
Frechet differentiable to ensure that continuous linear functionals have a unique representation 
by the semi-inner product. An infinite dimensional Banach space with the £^ norm is non- 
reflexive. As a consequence, there is no guarantee [13| that the semi-inner product is able to 



represent all continuous point evaluation functionals in such a space. For these reasons, we shall 
pursue the third approach in this study, that is, to represent the point evaluation functionals by 
a bilinear form. We briefly introduce the construction and main results of the paper below. 

Let X be a prescribed set that we call the input space. The construction starts directly with 
a complex-valued function K on X x X, which is not necessarily Hermitian. For the constructed 
space to have the three desirable properties described above, K needs to be an admissible kernel. 
To introduce this class of functions crucial to our construction, we denote for any set Q by 
the Banach space of functions on 0, that is integrable with respect to the counting measure on 
Q. In other words, 

:= {c = (cf G C : t G 17) : ||c||^i(f^) := ^ |q| < +oo}. 

ten 

Note that Q might be uncountable but for every c G £^i^), suppc := {t £ Q : ct ^ 0} must be 
countable. Finally, we deflne the set N„ := {1, 2, . . . , n} for all n G N. 

Definition 1.1. A function K on X x X is called an admissible kernel for the construction of 
RKBS on X with the £^ norm if the following requirements are satisfied: 
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(Al) for all sequences x = {xj : j G N,„} C X of pairwise distinct sampling points, the matrix 

A^x] := [K{xk,Xj) : j,k e N„,] G C"><" (1.1) 

is nonsingular, 

(A2) K is hounded, namely, \K{s,t)\ < M for some positive constant M and all s,t € X, 

(A3) for all pairwise distinct xj G X , j £ N and c G ^^(N), Xlj^i CjK{xj,x) = for all x £ X 
implies c = 0, and 



(A4) for all pairwise distinct xi,X2, ■ ■ ■ ,Xn+i G X, 

||(K[x])-iKx(x„+i)||,i(j,^) <1, 

where K^{x) = iK{x,Xj) : j G N„)^ G C". 
The following theorem will be proved in the next three sections. 
Theorem 1.2. If K is an admissible kernel on X x X then 



(1.2) 



B := I CtK{t, •) : c G i^iX) \ with the 

^ t£ supp c 



norm 



ctK{t, 

t£ supp c 



\C\\fl 



(1.3) 



■,Xj 



:= sup 



\ ^cjA:(x,xj) 



x£X 



and , the completion of the vector space of functions Cj-R'(-, Xj), Xj G X under the 

supremum norm 

n 

J=l 

are both Banach spaces of functions on X where point evaluations are continuous linear func- 
tional. In addition, the bilinear form 

/ n m \ n m 

Y,ajK{sj,-),Yi^kKi'^'^k)) := ^^aj6fcA:(sj,tfc), Sj,tk e X (1.4) 
S=i k=i ' ^ j=l k=l 

can be extended to B x B'^ such that 

\{f,g)K\ < \\f\\B\\g\\B» forallfeB, geB^ 

and 

if, K{;x))k = fix), {K{x, ■),g)K = g{x) for all x € X, f € B, g e BK 
Furthermore, for every regularized learning scheme of the form 

inf y(/(xi), /(X2), • • • , f{Xn)) + MII/IIb), 

where fi is a positive regularization parameter, V and (j) are nonnegative continuous functions 
with limt_).oo 4>(t) = +00, there exists a minimizer, /o, of the form 

n 

fo{x) = Y^j^(^j^^)^ xGX 
i=i 

for some coefficients cj £ C, j G N„. 

Conversely, for the constructed spaces B and B^ to enjoy those desirable properties, K must 
be an admissible kernel on X x X . 
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The organization of the paper is as follows. We first present a general construction of Banach 
spaces of functions with a reproducing kernel in the next section. In Section 3, we specify the 
construction to the building of RKBS with the norm as described in Theorem In Section 
^, we study the conditions on the reproducing kernel so that regularized learning schemes on 
the constructed spaces satisfy the linear representer theorem. In the last section, we show 
that the Brownian bridge kernel and the exponential kernel are admissible kernels. In the final 
section, condition (A4), the most stringent condition in Definition 1.1 is relaxed, which leads to 



a modified version of Theorem 1.2 



2 A General Construction 

To ensure that there exists a reproducing kernel, we shall start the construction of the Banach 
space based on such a function. Let X be an input space and let K he a, function on X x X. 
Introduce the vector space 

Bq := span{J^(x, ■) : x £ X}. 

Note that unlike reproducing kernels for Hilbert spaces, this K is not necessarily symmetric in 
its arguments or positive definite. Suppose that a norm || • is imposed on Bq such that point 
evaluation functionals are continuous on Bq. That is, for any x £ X, there exists a positive 
constant such that 

Mf)\ = 1/(^)1 < for all / G Bo. (2.1) 

The function K and the norm on Bq will be explicitly given in a specific construction. 



In [31, 33, 32 1, a vector space B is called an RKBS on X if it is a uniformly convex and uni- 
formly Frechet differentiable Banach space of functions on X and point evaluation functionals 
are continuous on B. The uniform convexity and uniform Frechet differentiability were imposed 
there to ensure the existence of a reproducing kernel for representing the point evaluation func- 
tionals. By the results to be established in the current paper, these stronger conditions are not 
necessary. To accommodate the search for alternatives, we introduce the following definitions. 

Definition 2.1. The space B is called a Banach space of functions if the point evaluation 
functionals are consistent with the norm on B in the sense that for all f G B, = if 

and only if f vanishes everywhere on X. A Banach space B of functions on X is said to be a 
pre-RKBS on X if point evaluations are continuous linear functionals on B. 

We plan to complete Bq by the norm || • to obtain a pre-RKBS B. Two things need 
to be checked for the approach to succeed. An abstract completion of Bq might not consist of 
functions, or might not have bounded point evaluation functionals. We shall present a Banach 
completion process that yields a space of functions. Let {/„ : n G N} be a Cauchy sequence 
in Bq. Since point evaluation functionals are continuous on Bq, for any x £ X, the sequence 
{fn{x) : n G N} converges in C. We denote the limit by f{x), which defines a function on 
X. One sees that two equivalent Cauchy sequences in Bq give the same function. We let B be 
composed of all such limit functions with the norm := lim^^oo ||/n||Bo- 

To investigate conditions for to be a pre-RKBS, we need to invoke the following assumption. 

Definition 2.2. A normed vector space V of functions on X satisfies the Norm Consistency 
Property if for every Cauchy sequence {fn : n G N} inV , lim fn{x) = for all x £ X implies 

n— >oo 

lim WfnWv = 0. 
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Proposition 2.3. The norm \\ • \\q is well-defined and makes B a pre-RKBS on X if and only 
if Bq satisfies the Norm Consistency Property. 

Proof. We first show the necessity. If ;S is a Banach space then || • ||g is a well-defined norm. 
The validity of the Norm Consistency Property follows directly from ||0||b = 0. 

We next prove the sufficiency. Suppose that the Norm Consistency Property holds for Bq. 
We first show that || • ||b is a well-defined norm. Suppose that {/„ : n G N} and {gn : n G N} 
are both Cauchy sequences in Bq such that lim fnix) = lim Qnix) for all x G X. We need 

n— >oo 71— >oo 

to show that lim ||/n||Bn = lim ||gn||Bn- Clearly, /„ — gn forms a Cauchy sequence in Bq. 

n— >oo n— >oo 

Since lim (/„ — gn){x) = for all x £ X, it follows from the Norm Consistency Property that 
lim II — (7„||g = 0, which implies lim ||/n||Bn = lim Hfi'nllBo- Therefore, || • ||b is well-defined. 

n— ^oo n— >oo n— >oo 

As a result, B is isometrically isomorphic to the abstract Banach space that is the completion of 
Bq. It implies that is a Banach space and Bq is dense in B. Moreover, it follows immediately 
from the Norm Consistency Property that i3 is a Banach space of functions. It remains to show 
that the point evaluation functional 5x is continuous on B for all x € X. Let x € X and f G B. 
By definition, there exists a Cauchy sequence {/„ : ?i G N} in Bq such that 

f{x) = lim fn{x) for ah x £ X, and ||/||b = lim ||/n||so- 

n— >oo n— >oo 

Since 6x is continuous on Bq, there exists a positive constant such that 

|/n(x)| <M,||/„||eo foraUnGN. 

Taking the limits on both sides, we have |/(a;)| < Ma;||/||g. The proof is complete. □ 

In the rest of this section, we assume the Norm Consistency Property for Bq and aim at 
deriving a reproducing kernel for B. To this end, we set 

bI := span{K(-,x) : x G X} 
and define a bilinear form on Bq x Bf^hy ( |1.4|) . It is straightforward to observe that 

(/, Ki-, x))k = fix), {K{x, ■),g)K = g{x) for all f e Bq, g e bI and x e X. 



It means (|l.4| ) is well defined and that K is able to reproduce the point evaluations of functions 
on Bq via this bilinear form. We need to extend this property to the whole space B in order to 
claim that it is a reproducing kernel for B. For this purpose, we define another norm 

II II \{f:9)K\ ^ „t /„ „x 

IbllgS := sup —— , g G B^ (2.2) 

" f&BoJ^O 11/11% 

The next result indicates that the above norm is well-defined. 

Proposition 2.4. The norm \\ ■ |Lti is well-defined and point evaluation functionals are contin- 
uous on Bq if and only if point evaluation functionals are continuous on Bq. 

Proof. We begin with the sufficiency. Suppose that point evaluation functionals are continuous 
on Bq. That is, for any x ^ X there exists a positive constant satisfying Let g £ Bq. 
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It must be of the form g = Yl]=i ^j^{'-> ^j) ^^r some Oj G C and Xj & X, j £ N„, n £ N. We 
have for all f £ Bq 



\{f,g)K\ _ \{f,EU<^,Ki.,xj))K\ 



ELia]f{xj)\ 



n 



Bo ll/llBo ll/llBo 

which implies that ll^llKiii is well-defined. We next prove that point evaluation functionals are 

"o 

continuous on Bq. By (^), we have for all / € i3o, g G iSg 

\{f,9)K\<\\f\\BM\Bl (2-3) 
For any x £ X, taking / = ii'(x, •) in the above inequality yields that 

\g{x)\ = \{K{x,-),g)K\ < \\K{x, ■)\\bo\\9\\q» for all g G Sf,. 

It follows that the point evaluation functional 6x is continuous on Bq as ||-ftr(a;, ■)\\bo is a constant 
independent of g. 

We next turn to the necessity. Suppose H^H^ti is well-defined for all g £ Bq. For any x £ X, 
letting g = K{-,x) in (|2^) yields 



< \\K{-,x)\\ t\\j iiso, 

which implies that point evaluation functionals are continuous on Bq. □ 

We complete Bq using the norm || • ||gtt to a Banach space B^ by the process described before 
Proposition |2.3|. We have the following observation similar to that about the space B. 



Proposition 2.5. The space B^ is a pre-RKBS on X if and only if the normed vector space B'q 
satisfies the Norm Consistency Property. 

In the following discussion, suppose that B\ endowed with the norm || • ||gtt has the Norm 
Consistency Property. By applying the Hahn-Banach extension theorem twice, we can extend 
the bilinear form {■,-)k from Bq x Bq to B x B^ in a unique way such that 

\{f.9)K\<\\f\\B\\9\\BU f£B,gGBK (2.4) 

The next result tells that the definition of 11 • ||„tt in (2.2) can be extended to BK 

"o — 

Proposition 2.6. Suppose that point evaluation functionals are continuous on Bq. If both Bq 
and B'q satisfy the Norm Consistency Property then we have 

II II \{f^9)K\ ^ „* 

WgWsi = sup , g G BK (2.5) 



Proof. By ( |2.4D , the right hand side above is bounded by the left hand side. We only need to 
prove the other direction of the inequality. We first show it for functions in B'q. Let g £ B'q. It 
is straightforward to observe that 

WaWsi = sup — < sup — -— . (2.6) 

feBoj^o 11/ lie feBj^o WjWb 
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Now let g be an arbitrary but fixed function in iS". Since Bq is dense in iS", there exists 
{gn : n ^N} C Bq such that \\g — gnWst ^ as n ^ oo. This together with ( p.6p impHes 



I Bit 

hm WgnWst < lim 



sup 



K/,5n)i^| 



Note that 



K/,5n)A'| . K/,5)i^l , K/,5-fi'n)x| , K/,^)^! , II II 

< \ n-^rn < — FTTTi h \\g - 9n\\Bi- 

B 



\B WJWB WJWB 

It follows from the above two equations that 



Ifl'llgtt < hm sup 



\{f,9) 



K\ 



+ 115 - 9n\\Bi 



sup 



which completes the proof. 



□ 



We next present necessary and sufficient conditions for K to be able to reproduce point 
evaluation functionals on B and B^ by the bilinear form. We shall see that assuming the Norm 
Consistency Property, both B and B^ are Banach spaces of functions on X such that the point 
evaluation functionals are continuous and can be represented by the bilinear form with the 
function K. It is in this sense that B and are said to be a reproducing kernel Banach space 
with the reproducing kernel K. 

Theorem 2.7. Suppose that Bq and Bq satisfy the Norm Consistency Property. Then both B 
and B^ are pre-RKBS on X and the kernel K reproduces function values via the bilinear form, 
namely, 

if, K{-,x))k = fix) for allxeX and f eB (2.7) 

and 

{K{x, ■),g)K = g{x) for all x ^ X and g G BK (2.8) 
Thus, B and B^ are reproducing kernel Banach spaces (RKBS). 

Proof. By Propositions 2.3 and 2.5, both B and B^ are pre-RKBS on X. For each f £ B, there 
exists a sequence {/„ : n G N} C Bq convergent to /. As a consequence, we have for any x £ X 

f{x) = lim fn{x) = lim {fn, K{-,x))k- 



By (2.4), {■,K{-,x))k is a bounded linear functional on B, which implies 

lim {fn,K{;x))K = {f,K{;x))K. 

Combining the above two equations proves (ITtI ). Equation (^) can be proved similarly. □ 

We next discuss the relationship between the space ^B" and the dual space B* of B. It is clear 
by (m) and (|2j) that the mapping C from B"^ to B* defined by the bilinear form, 



{£g){f):={f,g)K, feB, g £ B\ 



(2.9) 



is isometric and linear. In other words, C is an embedding from B^ to B* . We next present a 
necessary and sufficient condition for it to be surjective. 
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Proposition 2.8. Suppose that both Bq and Bq satisfy the Norm Consistency Property. The 
mapping C defined by ( jg.^ j is surjective onto B* if and only if for any proper closed subspace 
Ai ^ B, the orthogonal space A4-^ := {g B^ : {f,g)K = for all f G Ai} is nontrivial. 

Proof. We first prove the necessity. For any proper closed subspace Ai B, hj the Hahn- 
Banach theorem, there exists a nontrivial functional v G B* such that i'{f) = for all / G A^. 
If C is surjective then there exists a function g B^ such that C{g) = v, namely, i'{f) = {f,g)K 
for all f G B. It follows that g G A4'^ and g 7^ as is nontrivial. 

We next show the sufficiency. Let v he a nontrivial functional in B*. Then its kernel ker(i^) 
is a proper closed subspace of B. By assumption, there exists a nonzero function g G Ai'^. 
This enables us to find a function /o G B\Ai such that {fo,g)K / and J^(/o) = 1. Set 
50 := 9/{fo,9)K- Since / - u{f)fo G ker(i/) for all / G i3, we get for any f € M 

{f,go)K = if - v{f)fo,gQ)K + {i^{f)fo,9o)K = i^{f){fo,go)K = v{f), 
which implies that C is surjective. □ 

We close the section with a conclusion on the general construction and the related results 
presented above. 

Theorem 2.9. Suppose that 

(a) the vector space Bq = span{i('(j;, •) : x G X} with the norm \\ ■ has the Norm Consis- 
tency Property, and 

(b) point evaluation functionals are continuous on Bq. 
Then the following statements hold true: 

(1) Bq can he completed to a pre-RKBS B on X; 

(2) the norm \\ ■ H^tt given by l \2.^ ) is well-defined and point evaluation functionals are bounded 
on Bq with respect to this norm; 

(3) ifB^ satisfies the Norm Consistency Property as well then Bq can be completed to an RKBS 
B'^ and K is the reproducing kernel for both B and B^ in the sense that {2.1} and ({2.1 



hold true. In this case, can be isometrically embedded into B* via the bilinear form, 
and the embedding is surjective if and only if for any proper closed subspace At of B, Ai-^ 
is nontrivial. 

3 RKBS with the Norm 

We shall follow the procedures in Theorem p.9| to construct an RKBS with the £^ norm in this 
section. To start, we let i^T be a bounded function on X x X such that 

K{xj, ■),j G N„ are linearly independent for all pairwise distinct points xj G X,j G N„. (3.1) 

Note that this assumption is implied by Admissibility Assumption (Al), but is somewhat weaker 
than (Al). Introduce an £^ norm on Bq = span{K[x, •) : x G X} by setting for all finitely many 
pairwise distinct points Xj G X and constants Cj G C, j G Nm, m G N 



i=i 



m 



Bo j=i 
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Since K is bounded, it is clear that point evaluation functionals are bounded on Bq. We next 
check the important Norm Consistency Property and find that it is implied by the Admissibility 
Assumption above. 



Proposition 3.1. The space Bq with the norm l \3.^) satisfies the Norm Consistency Property 
if and only if K satisfies (A3). 

Proof. We first show the necessity. Suppose that for some c G £"^(N) and pairwise distinct 
{xj G X : j G N}, CjK{xj,x) = for all x e X. Let /„ := Y^]^-^ CjK{xj, •) for ah n G N. 

Since c G ^^(N), {fn '■ n £ N} forms a Cauchy sequence in Bq. Moreover, lim /„(x) = for 

n—^oo 

all X G X as X is bounded on X x X. It follows from the Norm Consistency Property that 
lim ||/n||Bn = lim y]?=i Icil = llclLimi = 0. Therefore, (A3) holds true. 

On the other hand, suppose that K satisfies (A3). Let {/„ : n G N} be a Cauchy sequence 
in Bq with lim fn{x) = for all x £ X. We can find pairwise distinct Xj G X, j G N such that 

n— >oo 

for any n G N 

oo 

fn — ^ ^ CnjK(^Xj, •), 

where c„ := (cnj : j G N) has finitely many nonzero components. By definition (|3.2| ), {c„ : n G 
N} is a Cauchy sequence in £^(N). Let c be its limit in £-'^(N) and define 

oo 

i=i 

Suppose that |il'(s,t)| < M for some positive constant M and all s,t £ X. A direct calculation 
gives that for any x £ X 



\fn{x)-f{x)\ 



oo 

E 



i.^n,j Cj)K{Xj,X^ 



< M\\Cn - C||^i(p^). 



It follows that lim fn{x) = f{x) for all x £ X. Since lim fn{x) = for all x G X, we have 

n— >oo n— >oo 

f{x) = for all X £ X. By (A3), c = 0, which implies 

lim ll/nllBo = lim ||c„||<>i(n) = ||c||£im) =0. 

The proof is complete. □ 
Functions K satisfying property (A3) will be given later. We assume for the time being that 



(A3) holds true. One sees from the proof of Proposition 3T that B has the form (O). We 
remark that in the preparation of the paper, we came across a Banach space with a form similar 
to ( |l.3| ) used in ||3^ for error estimates with linear programming regularization. One observes 



from ( |1.3| ) that (.^{X) is isometrically isomorphic to B through the mapping 

$(c) :=^ctJf(t,-), c£l\X). 

In this sense, we say that is a pre-RKBS on X with the £^ norm. It remains to derive a 



reproducing kernel for it. By Theorem |2.7| , it suffices to check the Norm Consistency Property 



9 



for Bq. We shall show that the Norm Consistency Property automatically holds true for Bq 
without any additional requirement. To this end, we first calculate a specific form of the norm 

II ■ IIrS • 

Denote for any function 5 on X by ||5'||l°°(X) the supremum of \g{x)\ over x € X. 
Lemma 3.2. There holds for any function g Bq that \\g\\^ii = ||fi'||L°°{X) • 

Proof. We first prove that Hfi'llgtt is bounded by Hs'llLoo^jjc). Any f £ Bq has the form / = 
"^^=1 CjK{xj, •) for some Cj G C and pairwise distinct Xj £ X, j £ N^. We verify that 



K\ 



^CjK{xj,-),g) = ^Cjg{xj) < \\g\\L°-{x)^\cj\ = hh-^ix) 



j=i ' j=i 



Bo, 



which implies H^H^tt < II5||l°°(x)- For the other direction, we notice for all xq £ X 

II II ^ \{K{xo,-),g)K\ I , 
»o \\K{xo,-)\\bo 

Since xq is arbitrarily chosen, we have ||(7||gtt > lbllL°°(x)- D 

We show that the space B^ is also a pre-RKBS on X. 
Lemma 3.3. The space Bq satisfies the Norm Consistency Property. 

Proof. Let {/„ : n £ N} be a Cauchy sequence in M with lim fnix) = for all x £ X. By 

n— >oo 

Lemma there exists for any e > some positive integer A'^o such that when m,n> Nq, 

\fm{x) - fnix)\ < e for all X G X 
Since lim fn{x) = 0, we let n goes to infinity in the above inequality to obtain that when 

n— >oo 

m > No, 

\fm{x)\ < e for all x £ X. 
In other words, ||/m||L°ofx~) ^ ^ when m > Nq, implying lim ||/n,||Loocx) =0. □ 



By Proposition and Lemmas and 3.2, we conclude our construction of RKBS with 
the norm in the following result. 

Theorem 3.4. Let K be a bounded function on X x X that satisfies (A3). Then B having the 
form (l.S) and B^ are RKBS on X with the reproducing kernel K. 

We shall discuss in the rest of this section conditions on translation invariant K :W^xW^ ^ C 
for which Admissibility Assumption (A3) holds. Specifically, such K are of the form 

K{s,t)= [ e-'^^-*^-^ip{^)d^, s,t£R'^, (3.3) 

where s ■ t stands for the standard inner product on M'^, and if £ L^(M"'), the space of Lebesgue 
integrable functions on M'^. One should not confuse L^(M"') with The latter one is 

defined with respect to the counting measure on M'^ while the first one is with respect to the 
Lebesgue measure. Note that K is bounded and continuous on X M"'. We give a sufficient 
condition for so defined a function K to satisfy (A3). 
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Proposition 3.5. Let K he given by If if is nonzero almost everywhere on M with respect 

to the Lebesgue measure then K satisfies (A3). 

Proof. Suppose that there exists c S ^^(N) and pairwise distinct points Sj G M*^, j G N such that 

oo 

^c.jK{sj,t) = for all t G M"^. 
This equation can be reformulated by (^^) as 

/ "fi^y^'^d^ = for all t e M"^- 

It follows that for almost every ^ G M'^ with respect to the Lebesgue measure 



(f2c,e-^'^-^)m = 0. 



By the assumption on (p, 



CjC ^^•' ■^ = for almost every ^ G M'^. 

i=i 

Note that the function on the left hand side above is continuous on ^. We hence obtain that the 
Fourier transform of the discrete measure 

I' (A) := Cj for every Borel subset A C M'^ 
is zero. Consequently, v is the zero measure, implying c = 0. □ 



We next present a particular example as a corollary to Proposition 3.5 



Corollary 3.6. If (f) is nontrivial continuous function on M"' with a compact support then 
K{s,t) = 0(s - t), s,t G M"^ satisfies (A3). 

Proof. We regard (/> as a tempered distribution and note by the Paley- Wiener theorem that the 
Fourier transform of (j) is real-analytic on M"^. Therefore, the Fourier transform of (j) is nonzero 
everywhere on M"^ except at a subset of zero Lebesgue measure. The arguments similar to those 
in the proof of the last proposition hence apply. □ 



We next present by Proposition 3.5 and Corollary several examples of K that satisfy 
(A3) and hence can be used to construct RKBS with the £^ norm. Such functions include: 



the exponential kernel 

d 

where for s G M'^, ||s||2 is its standard Euclidean norm on M"^. 
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the Gaussian kernel 



(3.4) 



inverse multiquadrics 



K{s,t)= (^ ^^||^_^||2 j , /3>0, (3.5) 

whose Fourier transform is given by the modified Bessel function and is positive almost 
everywhere on M'^ (see [28|, pages 52, 76 and 95). 

B-spline kernels 

d 

K{s,t) = YlBp{sj -tj), s,teR'^, 
i=i 

where sj is the j-th component of s and Bp denotes the p-th. order B-spline, p > 2. B- 
spline kernels satisfies (A3) as they are given by bounded continuous functions of compact 
support. 

radial basis functions of compact support, including Wu's functions [^] and Wendland's 
functions |25]. Such functions are of the form K{s,t) = 4>{\\s — t||2), s,t G M'^, where (f> 
is a compactly supported univariate function dependent on the dimension d. We give two 
examples for d = 3: 

0(r) := (1 - r)l and (f>{r) := (1 - r)Ul + 4r), r > 



where t+ := max{0,t} for t G M. These functions satisfy (A3) by Corollary |3.(j . 

On the other hand, a translation invariant K does not satisfy (A3) if its Fourier transform 
is compactly supported, as indicated in the next result. 

Proposition 3.7. If f G L^{M'^) is compactly supported on M"^ then K given by ^. j|j does not 
satisfy (A3). 

Proof. Without lost of generality, we may assume that suppc/? C [—1, 1]'^. Choose a nontrivial 
infinitely continuously differentiable function cf) that is supported on [— 7r,7r]'^ and vanishes on 
[—1,1]'^. We expand (/> to a Fourier series 

where Cj is the Fourier coefficient of 0. Note that {cj : j S Z'^} S i^Clj'^) as (j) is infinitely 
continuously differentiable on [— 7r,7r]'^. By arguments in the proof of Proposition 3.5 , 



By our construction, 



^cjK{j,t)= [ (^Cje-'iA^i^)e''<dt tGR^. 



which implies '^j^z'^ CjK{j , ■) = 0. Moreover, cj 7^ for at least one j € Z'^ because (j) is 
nontrivial. We obtain that K does not satisfy (A3). □ 
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By Proposition 3.7, the sine kernel 

d 



s,t) := sine s - t := I 1 V^-^ s,teR'^ 

does not satisfy (A3). As a consequenee, it ean not yield an RKBS with the £^ norm by the 



proeedures introdueed in this seetion. Similar arguments as those in the proof of Proposition 3.7 
are able to show that if i/ is a compaetly supported Borel measure on M'^ of finite total variation 
then the following funetion 



K{s,t) := [ e-'^'~*>^du{^), s,tGM^ 



does not satisfy (A3). Instanees inelude the elass of Bessel-based radial funetions [|^] where the 
Borel measure is the dirac delta measure on the unit sphere of the Euclidean spaee. 

4 Representer Theorems in RKBS with the £^ Norm 

Up to now our arguments have relied on Admissibility Assumptions (Al)-(A3). In this seetion 
the final assumption, (A4), is invoked to guarantee that the representer theorem should hold for 



the eonstrueted RKBS. A regularized learning seheme in the RKBS B eonstrueted by (1.3) ean 
be generally expressed as finding /q sueh that 

/o = argmin[y(/(x)) + fiHWfh)], (4.1) 

where x := {xj £ X : j £ N„}, n G N, is the sequence of given pairwise distinct sampling points, 
/(x) := {f{xj) : j € N„) € C", 1/ : C" — M+ is a loss function, ^ is a positive regularization 
parameter, and (p : M+ — t- M-|_ is a nondecreasing regularization function. Here, M+ := [0, +oo). 
The loss function and regularization function should satisfy some minimal requirements for the 



learning scheme (4T) to be useful. This consideration gives rise to the following definition. 



Definition 4.1. A regularized learning scheme (4-1) is said to be acceptable if V and (j) are 
continuous and 

lim (j){t) = +00. (4.2) 



It is possible that the solution to (4.1) is non- unique, and in that ease we are only interested 
in finding one possible solution. 

We now introduce the main concept of this seetion. 

Definition 4.2. The space B is said to satisfy the linear representer theorem for regularized 



learning if every acceptable regularized learning scheme (4-1) has a minimizer of the form 



fo = J2cjK{x„-), (4.3) 
i=i 

where cj 's are constants. In other words, there exists a solution /o lying in the finite dimensional 
subspace := span{K{xj, ■) : j £ N.«}. 
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An RKHS with K being its reproducing kernel in the usual sense always satisfies the linear 
representer theorem [H. The result for uniformly convex and uniformly Frechet differentiable 



pre-RKBS with a reproducing kernel given by the semi-inner product was established in |31, 32]. 
For more information on this important property for RKHS and vector-valued RKHS, see, for 
example, |l], ^ and the references cited therein. 

Our purpose is to discuss the conditions on K such that B satisfies the linear representer 
theorem. The representer theorem for ( [4.1[ ) is closely related to the representer theorem for the 



minimal norm interpolation problem. In the RKHS case, an equivalence was proved in |16]. 
We shall follow the approach to consider the minimal norm interpolation in B first. For any 
y £ C^, set Xy^{y) to be the subset of functions in B that interpolate the specified data, namely, 
^{y) '■= {f ^ S : /(x) = y}. A minimal norm interpolant in ;B is a function /min satisfying 

/min = argmin{||/||B : f €l^{y)}. (4.4) 

Again, in the case of a non-unique solution, we are only interested in obtaining one solution. 
Since K[x\ is nonsingular, one sees that the typically infinite dimensional Xy^{y) always has a 
non-empty intersection with S^, for all y G C" and pairwise distinct x C X. 

Definition 4.3. An RKBS B is said to satisfy the linear representer theorem for minimal norm 
interpolation if for any choice of data, x and y, there is a minimal norm interpolant, 
lying in S^. 

We shall show that B satisfies the linear representer theorem for regularized learning if and 
only if it does so for minimal norm interpolation. We first prove one direction of the equivalence. 

Lemma 4.4. // B satisfies the linear representer theorem for the minimal norm interpolation, 
then it also does so for regularized learning. 

Proof. Let V, (f), and fi be arbitrary, but fixed according to the conditions that be an 

acceptable regularization scheme. For an arbitrary function / in B. We let /o be the minimizer 



of infggx^(j(x)) \\g\\B that has the form (gj). Then /o(x) = /(x) and ||/o||b < ||/||b- As a 
consequence, V{fo{x.)) = y(/(x)) but ^(||/o||b) < </'(II/I|b) (j) is nondecreasing. It follows 
that 

inf y(/(x)) + fimh) = inf y(/(x)) + MII/IIb)- 



By ( [4. 2]) , there exists a positive constant a such that 

inf y(/(x)) + M||/||B) = ^ inf y(/(x)) + MII/llB). 

Note that the functional we are minimizing is continuous on B by the assumption on V, <j) and 
by the continuity of point evaluation functionals on B. By the elementary fact that a continuous 
function on a compact metric space attains its minimum in the space, ( [4.1[ ) has a minimizer 
that belongs to {/ G : ||/||e < a}. Therefore, B satisfies the linear representer theorem. □ 

For the other direction, it suffices to consider a class of regularization functionals with a 
particular choice of V and (p. In the limit of vanishing /i we recover the minimal norm interpolant. 

Lemma 4.5. // B satisfies the linear representer theorem for regularized learning, then it also 
satisfies the linear representer theorem for minimal norm interpolation. 
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Proof. We shall follow the idea in [16|. Choose any n G N„, any x = {xj £ X : j £ N„} with 
pairwise distinct elements, and any y G C". For every /x > 0, let /o,^ G be a minimizer of 
( [4.ip with the choice of 

F(/(x)) = ||/(x)-y||i, m=t- (4.5) 
Here, || • II2 is the standard Euclidean norm on C". Defining the 1 x n row vector function by 

K'^ix) := iK{xj,x) : j G N„) for ah x e X. 
It follows that /o,^ = K^{-)c^ for some G C". Then we have 

||K[x]c^ - y\\l = ||/o,^(x) - y\\l < y(/o,^) + M||/o,^||b) < ^(0) + Ml|0||s) = Ml 

As K[x] is nonsingular, the above inequality implies that {c^ : /U > 0} forms a bounded set in 
C". By restricting to a subsequence if necessary, we may hence assume that converges to 
some Co G C" as goes to zero. We shall show that /o.o := K^{-)co G is a minimal norm 
interpolant. 

Since converges to cq as /.t tends to zero, we first get 

lim ll/o,/. - /o,o||b = lim ||c^ - co||£1(n„) = 0. (4.6) 



Since point evaluation functionals are continuous on B, we obtain by (4.6) 

fofii^j) = lim /o,/.(2;j) for ah j G N„. (4.7) 

Now let g be an arbitrary interpolant, i.e., an arbitrary element of Xx(y). As fo^^ is a minimizer 
of (4.1) with the choice ( |4.5D , it follows that 

ll/o,A»(x) - y||2 + ^II/o,mIIb < [^(x) - y\\l + ^i\\g\\B = fJ'Wgh- (4.8) 

Letting — )• on both sides of the above inequality, we obtain by ( [4.7D ||/o,o(x) — y\\2 = 0, 
which implies that /o,o is also an interpolant, i.e,. /o,o S X^{y). It also follows from ([4. 8]) that 



ll/c/ille < Iblle for all /i > 0, which together with (4^) implies ||/o,o||b < 115116. Since g is an 
arbitrary function in Xy^{y) and /o,o £ X^iy), we see that /o,o is a minimal norm interpolant, 
i.e., a solution of (4^). The proof is complete. □ 



Combining Lemmas 4.4 and 4.5, we reach the characterization for B to satisfy the linear 
representer theorem. 

Proposition 4.6. The space B satisfies the linear representer theorem for regularized learning 
if and only if B satisfies the linear representer theorem for minimal norm interpolation. 

In view of the above result, we shall focus on necessary and sufficient conditions for the 
minimal norm interpolation in B to satisfy the linear representer theorem. To this end, we begin 
with the simplest case when only one more sampling point is added to x. Recall the definition 
of K^{x) from the introduction. It is worthwhile to point out that ii'x(2;) is in general not the 
transpose of K^{x) as K is not required to be symmetric. 

Lemma 4.7. Let x = {xj £ X : j £ N„} have pairwise distinct elements, let be an 

arbitrary point in X\x, and set x := {xj : j G N„+i}. It follows that the minimum norm 
interpolant in is the same as the minimum norm interpolant in S^, i.e., 

min _||/||b= min \\f\\B for all y £ , (4.9) 
/eix(2/)n5- /eXx(?/)n5- 

if and only if (|l.2|) holds true. 
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Proof. Notice that Xx(y) H has only one function / = K^{-)K[x]~^y. We next estimate the 
norm of functions in T^{y) n S^. Let g G 1^{y) H and h := g{xn+i)- Note that g is uniquely 
determined by b as it has already satisfied the interpolation condition ^(x) = y. In fact, as K\x] 
is nonsingular, g = K^(-)K\^]'~^y, where y = {y'^,b)'^ G C""^"*^. Direct computations show that 



K[x] K^{Xn+l) 



'K[x]-^y + l-K[x]-^K^{xn+i] 



where p := K{xn+i,Xn+i) - K'^{xn+i)K[x] ^K^{xn+i) and q := K'^{xn+i)K[x\ ^y - b. 
We now show sufficiency. If ( |1.2D holds true then we have 



\g\\B = \\K[x] ^2/||£i(N„+i) > \\K[x] ^y\\i^ 
> ||K[x]-iy||,i(M„) = Wfh, 







\{K[x])-'K^{xn+i)\ 



I2i _L lai 



which implies 
Since 5^^ C S^, 



Thus, (^^) holds true. 
On the other hand, if (I 



mm _ 
/ex^(y)n5- 



mm 
/eXx(y)n5x 



Q > min 
/Glx{y)n5- 



< 



mm 



is always true for all y G then we must have 
ll-^[x]"V||£i(N„+i) > II^W~"^2/||^i{N„) for all y G C" and 6 G C. 
In particular, the choices y = K^x^^Xn+i) and b = K^{xn+i)K[x]~^ K]^ (xn+i) +p yields that 

\\K[xr^yy(n„^,) = 



1 and \\K[x\ = ||(/^[x]) ^K^{xn+i)\ 



Combing the above two equations proves (1^). The proof is complete. 
We are now ready to present one of the main results in this paper. 



□ 



Theorem 4.8. Every minimal norm interpolant (4:4) ^ satisfies the linear representer theo- 
rem if and only if holds true for alln ^ N and all pairwise distinct sampling points xj G X, 
j G N„+i. 



Proof. The minimal norm interpolant (4.4) satisfies the linear representer theorem if and only 
if 

min \\g\\i3 = min 

Therefore, if the above equation holds true then since Xy^{y) n C Xx(t/) H C Xx(y), we 



obtain ( [4. 91) . By Lemma [4.7| , ( |1.2D is true for every Xn+i G X. 

It remains to prove the sufficiency. We shall first show \\g\\B > ™iii/eXx{y)n5^ WfWs for all 
9 G 2^x(y) n Bq. To this end, we express g as g = Y^JLi CjK{xj, ■) for some m > n and pairwise 
distinct {xj : j G Nm} C X. This can always be done by adding some sampling points, setting 
the corresponding coefficients to be zero, and relabeling if necessary. We let yj := g{xj), j G N^, 
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ui := {uj : j G Ni), and v/ = {xj : j G Ni} for 1 < Z < m. Note that y = Un and x = v„. It 
follows that g £ Iv^{um) n 5^^™ and thus, 



bile > min II/IIb. 

/GXv„,{ltm)n5^™- 



Since Xv„(tim) ^ '^vm^ii'^m.-i), we apply Lemma ^ to get 



min II/IIb ^ inin II/IIb = mi'^ 



It follows that 



/GXv^_l(ltm-l)n5"™-l 



Repeating this process, we reach 

||5||b> min ||/||b= min ||/||b for ah 5 G Xx(y) n ^0- (4.10) 

/eXv„(un)n5vn /eXx{y)n5- 

Now let g G 2x(y) be arbitrary but fixed. Then there exists a sequence of functions {gj G 
Bq : j £ N} that converges to g in 0. We let / and fj be the function in such that /(x) = y 
and /j(x) = gj{x), j G N. They are explicitly given by 

f = K^{.)K[^r'g{^) and f, = K^{-)K[^]-'g,{^), j G N. 

Since gj converges to 5 in B and point evaluation functionals are continuous on B, gj{x) — )■ ^(x) 
as j — )■ 00. As a result, lim ||/ — /j||g = 0. By ( 4.10|) , \\gj\\B > \\fj\\B for all j G N. We hence 

j-s-oo 

obtain that \\g\\i3 > ||/||b- Therefore, 

min ||5f||B > min ||/||b. 
96Xx(3/) /eXx(y)n5- 

The reverse direction of the inequality is clear as Ix(y) H C Xy^[y). □ 

We draw the following conclusion by Theorems and |4.8|. 



Corollary 4.9. Every acceptable regularized learning scheme of the form (4-1) has a minimizer 
of the form ( [^.4 ) if and only if the function K satisfies the property (l.i). 

In the last part of the section, we briefly discuss the linear representer theorem in B^ under 
the same assumption that K is bounded and satisfies (A3). By Theorem 3.4, B^ is an RKBS on 
X. Likewise, we call a regularized learning scheme 

/o = argminy(/(x)) +//,/)(||/||g,) (4.11) 



acceptable if V and (j) are continuous and (|4.2| ) is satisfied by (p. The space B'^ is said to satisfy 
the linear representer theorem if every acceptable learning scheme ( 4.11| ) has a minimizer of the 
following form 

n 

fo = Y,cjK{;Xj), (4.12) 
i=i 

where Cj's are constants. We follow similar approaches to those used for B to study this important 
property on bK 
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Proposition 4.10. Let x C X have pairwise distinct elements. Every acceptable regularized 
learning scheme (4-11) in has a minimizer, Jq lying in '■= span{K[-,Xj) : j G N„} if and 
only if there is a minimal norm interpolant, 



/min := argmin ||/||gti (4.13) 

/GBtt,/(x)=y 



lying in 5x for all y G C". 



Proof. The arguments of the proof are similar to those for B. One only needs to note that 
although the norm of a function in B'^ may not be known, any two norms on the finite dimensional 
vector space 5x are equivalent. □ 



To study conditions ensuring that the minimal norm interpolation ( 4.13 ) satisfies the linear 
representer theorem, we first identify a specific form of the norm 11 • ILtt under the assumption 

that K satisfies ( |1.2D . Notice that a function ff. = X]j=i ^j^i'^^j) ^ '^x ^ ;Sq can be represented 
as fc = c^K^{-). 



Lemma 4.11. Let x have pairwise distinct elements. The function K satisfies ^1.^ ) if and only 

\\f4^i = \\c^K[^]\\ooforallf, = c^K^{-), c G C", (4.14) 
where 11 • linn denotes the maximum norm on C". 



Proof. Suppose that K satisfies ( |1.2D for all x^+i G X \ x. Then we have for all x G X that 
||-ftr[x]~"^ifx(2;)||£i(N„) ^ 1- Let c G and x £ X. It follows from this inequality that 

\c^K^{x)\ = \c^K[^]K[^]~^K^{x)\ < ||c^A'[x]|U||i^[x]-iAx(a;)||^i(N„) < ||c^i^[x] lU, 



which implies by Lemma 3^ that for fc = c i^x(") 

||/e||e« = ||c^/^x(-)llL-{X)<l|c^i^[x]||oo. 

The other direction of the inequality is clear as we have 

||c^ir[x]|U = max{|c^irx(xj)| : j G N„} < ||c^irx(-)llL-(X) = ll/c||s«- 

It remains to show that ( [4.14| ) implies ( |l.2D . We prove this by construction. For any Xn+i G 
X, we can find a nonzero vector c G C" such that 

|c^Ax(x„+l)| = |c^K[x]i^[x]-li^x(x„+l)| = ||c'^/^[x]||oo||i^[x]-li^x(x„+l)||,l(N„). 

We then let fc = K^{-) and obtain by ( [4.14|) 

||c'^A'[x]||oo||A[x]-^Kx(x„+i)||^i(N^) = \fc[Xn+l)\ < ||/c||l-(X) = ll/cHett = ||c^A:[x] ||oo, 

which implies ( |l.2D for C"^A'[x] is not the zero vector. The proof is complete. □ 
We now show that ( |1.2[ ) is sufficient for B'^ to satisfy the linear representer theorem. 



Theorem 4.12. If K satisfies (Li) then B^ satisfies the linear representer theorem. 
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Proof. Suppose that (1.2) holds true. By Lemma |4.1C1| , it suffices to show that the minimal norm 
interpolation ( 4.13| ) has a minimizer of the form ( [4. 31) . We shall prove this by directly showing 
that /o = y'^ K['x]~^ Kx{-) is a minimizer for ( [4. 131 ). Let / be an arbitrary function in B"^ such 
that /(x) = y. Then we have by Lemma |3.2| 



By Lemma 4.11 



-(X) > ll/(x)||c 

\y^K[^]-'K[^]\U 

Combining the above two inequalities leads to ||/o||eti ^ 
mizer /o = y"^-ftr[x]~^i('x(") which has the form ( 4.12| ). 



\y\ 



ll/ol 



\y\ 



. Therefore, (4.13) has the mini- 



□ 



In the particular case when X has a finite cardinality, we shall show that condition ( |l.2D is 
also necessary for B'^ to satisfy the linear representer theorem. 

Proposition 4.13. If X consists of finitely many points and B^ satisfies the linear representer 
theorem then (1.2) holds true. 



Proof. Let c G C" and fc = c^K^i-). Under the assumptions, we get by Proposition [4.1C that 



/c is a minimizer for the minimal norm interpolation ( 4.13 ) with y = /c(x) = {K[x]) c. Since 



X has a finite cardinality and /^[x] is nonsingular for all pairwise distinct x C X, we can find 
a function g & Bq such that ^(x) = y and ||5'||lcx)(J(') < ||y||cxD. Since fc is a minimizer of ( 4.13 ) 
and g satisfies g{x) = y, 



|y|U = ||(i^[xf)c|| 



\\fc\\B6 < llfl'llBtt = \\g\\L°°iX) 

On the other hand, we have by Lemma p.2| 

ll/c|lB« = ll/c||L-(X)>ll/c(x)||oc = ||(i^[x] 

By the above two equations, ( [4.14| ) holds true. By Lemma 4.11, K satisfies (|L 



□ 



One observes that the key ingredient in the proof of Proposition 4.13 is to extend a function 



on the discrete set x to a function in B'^ in a way that the supremum norm is preserved. In 
many cases, this is achievable without X being a finite set. For instance, by the Tietze extension 
theorem in topology, such an extension exists when X is a compact metric space and X is a 
universal kernel [^] on X. Thus, for those input spaces X and functions K, B^ satisfies the 
linear representer theorem if and only if (|1.2D holds true. 



5 Examples of Admissible Kernels 



Recall the definition of admissible kernels from the introduction. Note that the first requirement 
(Al) in the definition implies (3.1). Theorem 1.2 is proved by combining Theorem 3.4 and 
Corollary O. By this result, admissible kernels are crucial for our construction. Functions K 
satisfying requirements (Al)-(A3) are usually relatively easy to find. Some examples have been 
presented before Proposition |3.7l in Section 3. However, requirement (A4) could be somewhat 
demanding and rule out many commonly used kernels. We are able to present two examples of 
admissible kernels below. 

The first example is Brownian bridge kernel that arises in the study of Brownian bridge 
stochastic process in statistics 
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Proposition 5.1. The Brownian bridge kernel defined by 

K{s,t) := mm{s,t} - st, s,tG(0, 1) 
is an admissible kernel on the input space X = (0, 1). 

Proof. We start with validating requirement (A4). Let < xi < X2 < • • • < 2;„ < 1 be given 
and X G (0, 1) be different from Xj, j G N„. Direct computations show that 

1. If X < xi then K[yi\-^K^{x) = (^|^, 0, . . . , o)^. 

1— x 



2. If X > Xn then K[il]-^K^{x) =[{),..., 

3. If Xj < X < Xj+i for some j G N„_i then 



Ki^r'K^ix) = f 0, . . . , 0, 3±i_^, ^ , 0, . . . , o) 

In all cases, it is straightforward to see ||/C[x]~^i^x(3^)||£i(j^ ^ ^ 1- Therefore, requirement (A4) 
is indeed fulfilled. 

To verify the other three requirements, we first observe 

K{s,t) = [ rs{z)rt{z)dz, s,t G (0,1), 
Jo 

where := X{o,x) ~ ^ with XA standing for the characteristic function of ^ C (0, 1). Suppose 
that K\x\c = for some c G C". Then we have 



„i n 



which implies that 



dz = c*K[x]c = 0, 



^^Cj-r^;^. (z) = for almost every z G [0, 1]. 

Clearly, F,,, j G N n are linearly independent. Therefore, cj — for all j G N.«. Requirement 
(Al) is hence satisfied. 

The function K is clearly bounded by 1. Suppose that for some c G £^(N) and pairwise 
distinct xj G (0, 1), j G N 

CO rl / ^ \ 

y ^ CjK{xjjX) = / ( 2, CjTx ■ {z) I Vx{z)dz = for all x G (0, 1). 
It implies that the function (j) := X^jLi ^j^x^. is orthogonal to r^^. for all x G (0, 1), that is, 

/ (j){t)dt - X / (l){t)dt = for all x G (0, 1). 
Jo Jo 
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Taking the derivative on both sides of the above equations yields that (p equals a constant C 
almost everywhere on [0, 1]. Namely, 

oo oo 

CjX[o,Xj] ~ CjXj = C almost everywhere. 

We now take the derivative of both sides of the equation above in the distributional sense to 
get '^j^f^CjSxj = 0. Let j be an arbitrary but fixed positive integer. We can find a sequence of 
infinitely continuously differentiable functions (j)^^ A; G N such that lli^fcllLooQo,!]) ^ 1; 4'kixj) = 1; 
and the Lebesgue measure of the set where (l)k is nonzero is less than or equal to p For each 
G N, we have for sufficiently large k that 

</.fc(t/) = 0foralUGN7v\{j}. 

We get for this cpk 







^ci6xA{(l)k) > \cj\ - \ci\. 

1=1 ' 1>N 



Since ^[^^ |q| converges to zero as — t- oo, we have cj = 0. Therefore, c = for j is arbitrary 
chosen. 

We conclude that all the four requirements of an admissible kernel are fulfilled by the Brow- 
nian bridge kernel. □ 

The second example is the exponential kernel (also called the Matern kernel). 

Proposition 5.2. The exponential kernel 

K(s,t) := e-l'-*l, s,t£R (5.1) 

is an admissible kernel on M. 

Proof. We have seen in Section 3 that this kernel satisfies requirements (Al)-(A3). It remains 
to check requirement (A4). Let xi < X2 < • • • < 2;„ be given and j; G M be different from Xj, 
j G N„. Direct computations show that 

1. If a; < xi then K[x]~'^K^{x) = (e^-^i, 0, . . . , 0)^. 

2. If a; > Xn then K[x]~^K^{x) = (0, . . . , 0, e^""^)'^. 

3. If Xj < X < Xj+i for some j G Nn-i then 

0, . . . , 0, z ^— , . ^— , 0, . . . , 

In all cases, ||K[x]~-^Kx(2;) ^ < 1. The proof is complete. □ 
Finally, we remark that by numerical experiments, the Gaussian kernel 

K{s,t) = expf- ^^ ~ ^M ' s,teM. 



does not satisfy (A4). Consequently, neither does the Gaussian kernel ( p.4| ) on W^. The same 
situation happens to the inverse multiquadric ( |3.5D when /3 = 1/2. 
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6 Relaxation of the Admissible Condition (A4) 



As seen above, the admissible condition (A4) is satisfied for few commonly used kernels. This 
section aims at weakening this requirement to accommodate more kernels. We are very grateful 
to the anonymous referee for a useful remark that inspired the approach below. 

Let i^T be a function on X x X that satisfies (Al)-(A3) and let B be constructed by ([1.3|). The 
condition (A4) is meant to ensure the validity of the linear representer theorem for regularized 
learning in B. To see how it can be relaxed, we first examine the role of the linear representer 
theorem in the learning rate estimate. Consider the norm coefficient-based regularization 
algorithm 

1 " 

~ Yl - + /^l|c||£i(N„) (6.1) 

where x := {xj : j € N^} is a sequence of sampling points from the input space X , yj £ Y Q C 
is the observed output on xj, /i is a positive regularization parameter. Following a commonly 
used assumption in machine learning, we assume that the sample data z := {{xj,yj) : j S N^} S 
X X y is formed by independent and identically distributed instances of a random variable 
{x,y) £ X xY subject to an unknown probability measure p on X xY. Let Cz^^ be a minimizer 
of (|6.1| ). We hope that the obtained function 

:= K"(x)c,,^, x£X (6.2) 

will well predict the outputs of new inputs from X. The performance of a general predictor 
/ : X — > y is usually measured by 

Sif) ■■= I \f{x)-y\^dp. 

JXxY 

The predictor that minimizes the above error is the regression function 

fpix) ■■= ydp{y\x), x £ X, 

where p{y\x) denotes the conditional probability measure of y with respect to x. This optimal 
predictor fp is unreachable as p is unknown. We shall approximate fp with f^^p. More precisely, 
we expect with a large confidence that the approximation error £{fz,^i) — ^ifp) would converge 
to zero fast as the number of sampling points increases. 

A standard approach Q in estimating the error £{fz,fi) — £{fp) is to bound it by the sum 
of the sampling error, the hypothesis error and the regularization error. Let g be an arbitrary 
function from B and set for each function / : X — ?• C 

1 

i=i 

The approximation error £{fz,p) — £{fp) can then be decomposed into the sum of four quantities 

£{f7.,p) -£{fp) = ■S{z,p,g) +'P{z,p,g) +V{p,g) - ^||/z,/,||b, 

where the sampling error, the hypothesis error and the regularization error are respectively 
defined by 

S{z,p,g) :=£{fz,p) - £z{fz,p) + £z{g) -£{g), 
Viz^p^g) := (<?z(/z,m) + fJ'\\fz,fi\\B) - {£z{g) + pWgh) , 
T^{i^,g) ■= £{g) - £{fp) + i^Wgh- 
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Under the condition (A4), B satisfies the linear representer theorem. As a result, 

S.iUf.) + ^^WUJb = min £,{f) + Mb = minfz(/) + f^WfU- (6.3) 

Immediately, one has V{z,fi,g) < 0, leading to the estimate 

SiU,^,) - Sifp) <Siz,fi,g) +Vip,g). 

Starting from the above inequality, learning rates of /z,/^ can be obtained |^5|. To weaken (A4), 
we should not stick to the linear representer theorem ([6.31 ). Instead, we wish to replace it with 
the relaxed linear representer theorem 

min £,{f) + Mb < mm^z(/) + Mlfh, (6.4) 

where /3„ is a constant depending on the number n of sampling points, the kernel K and the 
input space X. For simplicity, we suppress the notations K and X as they are fixed in our 
context. The approximation error <f (/z,/i) — £{fp) is accordingly factored as 

£{fz,p) -£{fp) = S{z,fi,g) +V{z,^i,g) +'D{fi,g) - ^||/z,^||b, 

where 

i>{fj',g) ■= £(.9) - S{fp) + i^Mgh- 

By ( [6.4]) , we keep the advantage that V{z,fj,,g) < 0. Therefore, 

£ifz,p) - Sifp) < Siz,fi,g) +'Difi,g). 

As long as /?„ does not increase too fast as n increases, one is still able to obtain a learning rate 
competitive with those in [25, 30 1. We shall omit the detailed arguments and assumptions on 
the kernel K, the regression function fp and the input space X, as they are similar to those in 
[p5|] . We present one result that for all < (5 < 1, there exists a constant Cs such that with 
confidence 1 — 6, we have 



, , / , ^ 2s log 7 , , 2s-2 log -J , , 2s-l log -J + logfl + n) r, 1 

£{fz,p)-£{fp) < Cs (M/3n)^ + ^(/U/3„) W + ^(^f]^)TTT + S SI Ipln-T+e 



S 2s-2 log T 2s-l log; T 

where s G (0, 1) represents the regularity of /p, 6* > is a positive constant related to assumptions 



on the kernel K and the input space X, ||25l]. Thus, as long as does not cancel the decay of the 

1 

term n i+e , one still has the hope of getting a satisfactory learning rate when /j. is appropriately 
chosen. We discuss two instances below: 

(i) If /?„ is uniformly bounded with a large confidence then £{fz,p) — £{.fp) has the same learning 
rate as that established in |25|, that is. 



£Uz,p) - £{fp) < Csn~—— log (6.5) 



2 + 2n 
T 



(ii) If /3„ < Cn" for some positive constants C and a < ^^^e then 

£{fz,p) - £{fp) < C5n-TT27(TT«-2") log ^±^. (6.6) 
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If we give up the linear representer theorem and pursue the relaxed version ( |6.4| ) instead, 
how can the admissible condition (A4) be weakened? We next answer this question. 



Proposition 6.1. // there exists some /3„ > 1 such that for all y G 



mm 



1 

B> mm ii/iiB 



(6.7) 



then the relaxed linear representer theorem (6.4) holds true for any continuous loss function V 
and any regularization parameter fj,. 



Proof. Suppose that (6/?) is satisfied. Let /o be a minimizer of 

mmy(/(x)) + A/5„||/||B. 

Choose g to be a function in that interpolates /o at x, namely, g{x) = /o(x). By (|6.7]), 

hh < PnWfoh, 

which yields 

V{g{^)) + \\\g\\B < T^(/o(x)) + A/3„||/o||b. 
The proof is hence complete. 



□ 



We next give a characterization of ( |6.7D , which gives rise to a relaxation of the admissible 
condition (A4) and leads to the relaxed linear representer theorem ( |6.4D . 



(6.8) 



Theorem 6.2. Equation ( [g. ?| j holds true for all y € C" if and only if 

||(A-[x])-iKx(t)||£i(N„) < /3n for all t E X. 



Proof. The set I^{y) H consists of only one function /o := K^{-)K[x]~^y. Let g be an 
arbitrary function in Xy^{y) n Bq. By adding sampling points and assigning the corresponding 
coefficients to be zero if necessary, we may assume g G S^^^ nXx(^/) for some t := {tj £ X : j £ 
Nm} disjoint with x. Let b := g{t), and denote by i^[t,x] and /^[x, t] the n x m and m x n 
matrices given by 



iK[t,:)^])jk:= K{tk,Xj), jGN„,A;GN„, {K[:)c,t])jk := K{xk,tj) : j G N„„ G N^. 



Then 

Mb- 
where 



KlxA] Kit] 



b 



, (6.9) 



b := {K[t] - A[x,t]A[x]~^A[t,x])"^(6- A[x,t]A'[x]"^y). 



Note that as b is allowed to equal any vector in C™', so is b. 

If ( |6.71 ) holds true for all y £ C"' then we choose t to be a singleton {t}, 6 = 1, and 
y = K[t,^] = K^{t) to get 
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which is (|6^). Conversely, suppose that (|6.8D is satisfied. We need to show that for all g G Ixiv) 



\\g\\t3>^\\fo\\B = ^\\K[x] '2/|Li(N„)- 

We shall discuss the case when g G 2x(2/) H Bq only as the general case will then follow by the 
same arguments as those in the last paragraph of the proof of Theorem 4. J. Let g G I:x.{y) H Bq 
have the norm (|6.9| ). Clearly, 

Mb > \\K[^]~'y\\eHN„) 
if ||/C[x]-iy||^i(pj^) < /3„||b||£i(N„). When ||i^[x]-i^/||^i(N„) > /3„,||b||£i(N„), we have 
Mb > ||i^[x]-^?/||^i(N„) - ||i^[x]-^i^[t,x]b||^i(N^) + ||fo||£i(N„) 

> ||i^[x]-li/||^i(N^^) - ( max ||i^[x]-^Kx(tfc)||^i(N„) ) ll^ll£i(N^) + ll&ll^i(N„) 

> II^M^^Z/||^i(N„) - (/5n - l)||fo||£i(N,„) > II^M"V||^i(N„) - Wn - 1) — || A'[x] "^^y ||^i 

^ Pn 

= ^II^M^^2/||£i(N„), 

Pn 

which completes the proof. □ 



The above result together with the discussion of the application of Proposition 6.1 



to regularized learning provides a relaxation of the requirement (A4). The quantity 
suptex ||-ftr[x]~-^i^x(i)||£i(N„) is the Lebesgue constant of the kernel interpolation. Asking 
it to be exactly bounded by 1 is indeed demanding. Recent numerical experiments |^] and 
analysis [ |12| indicate that for many kernels, this Lebesgue constant could be uniformly bounded. 
In this case, the -^^-regularized learning in B performs well by (|6.5| ). Furthermore, as long as 
(3n does not increase to infinity too fast, the learning scheme can still work well by ( [6.61 ). 
Specifically, it was proved in ||l^ that the Lebesgue constant for the reproducing kernel of 
the Sobolev space on a compact domain is uniformly bounded for quasi-uniform input points 
(see, Theorem 4.6 therein). Another example is given in Q for translation invariant kernels 
K{x, y) = (j){x — y), x,y € M''. It was shown there that as long as 

ci(l + ||^||i)~"<<^(0<C2(l + ||^i)"^ ||^||2>M (6.10) 

for some positive constants ci,C2,M and r, the Lebesgue constant for quasi-uniform inputs is 
bounded by a multiple of ^/n. Commonly used kernels satisfying ( |6.10D include Poisson radial 
functions [l^, Matern kernels and Wendland's compactly supported kernels |28]. Finally, we 
remark from numerical experiments that the following kernels [20| 

exp (^-\\x - y\\Jp^^^^^ , x,y£R'^, 7G(0, 1), p = l,2 

seem to satisfy (A4) for small enough 7 and moderate n. We shall leave the search of more 
kernels satisfying (A4) and its relaxation (|6.8|) as an open question for future study. 
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7 Numerical Experiments 



We end this paper with a numerical experiment to show that the regularization algorithm ( f4.1| ) is 
indeed able to yield sparse learning compared to the classical regularization network in machine 
learning. 

We shall use the exponential kernel K ( |5.1| ). Let B be the corresponding RKBS with the 
norm constructed by ( [1.3| ) and let T-Lk be the RKHS of K. We restrict ourselves to the field of 
real numbers and use the square loss function y(/(x)) := ||/(x) — y\\^. We shall compare the 
two models 

|2 



min II f (x) 
/eB " ^ ' 



and 



mm 

9&Hk 



l5(x) 



y|l2 + /^ 



y\\l + M\H, 



Both of them satisfy the linear representer theorem. Specifically, the minimizers /o and of 
the above two models are respectively given by 



/o = K^{-)h with h := argmin{||ir[x]c 



Vh + Ml|c||fi(N„) 



n)} 



and 



= K^{-)h with h := argmin{||K[x]c - y\\\ + ^c^K[x]c}. 



We point out that the above £^ minimization problem about h does not have a closed form 
solution. There are numerous methods proposed to solve this problem and here we employ 
the proximity algorithm recently developed in [^]. The closed form of the minimizer h is well 
known to be (i^[x] + filn)~^y- Here denotes the n x n identity matrix. 

For both models, x is set to be 200 equally spaced points in [—1, 1] and the output vector y 
is chosen to be the evaluation of the target function 

-|a:+O.I 



e-l^+^l+e- 



+ e-l^l + e-l^-°-S| + e-l^-^l, x G [-1, 1] 



at X and then disturbed by some noise. Also, the regularization parameter /i for each model 
will be optimally chosen from {lO-' : j = — 7, — 6, . . . , 1} so that the distance between the 
learned function and the target function in L^([— 1,1]) will be minimized. We then compare 
the approximation accuracy measured by this error and the sparsity for these two models. The 
sparsity is measured by the number of nonzero components in the coefficient vectors b and h. 





Gaussian noise 


Uniform noise 


Pepper sauce noise 


Error Sparsity (Max) 


Error Sparsity (Max) 


Error Sparsity (Max) 


RKHS 
RKBS 


2.1E-3 200 (200) 
l.OE-3 13.4 (17) 


7.9E-4 200 (200) 
3.6E-4 14.7 (25) 


9.4E-4 200 (200) 
4.5E-4 14.5 (23) 



Table 1: Comparison of the least square regularization in RKHS and in RKBS with the £^ 
for the exponential kernel. 



norm 



We test both models with three types of noise: Gaussian noise with variance 0.01, uniform 
noise in [—0.1,0.1] and some random pepper sauce noise in {—0.1,0.1}. For each type of noise, 
we run 50 times of numerical experiments and compute the average approximation error, the 
average sparsity, and the maximum sparsity in the 50 experiments. The results are tabulated 
above. 
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